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Abstract 

Least squares methods are sophisticated mathematical curve fitting 
procedures used in all classical parametric methods. The linear least squares 
approximation is most often associf ted with finding the "line of best fit" or the 
regression line. Since all statistical analyses are correlational and all classical 
parametric methods are least squares procedures, it becomes imperative to 
understand just exactly what the leai squares procedure is and how it works. 

This paper illustrates the least squares procedure, starting with the simplest 
case of linear least squares with one independent variable and one dependent 
variable and generalizes to n independent variables with vector and matrix 
notation. Graphical representations and small heuristic examples are given. A 
brief generalization to nonlinear least squares is presented. 



The least squares method is a mathematical curve fitting procedure. The 
linear least squares approximation is most often associated with finding the "line 
of best fit" or the regression line. Since all statistical analyses are correlational 
and all classical parametric methods are least squares procedures (Thompson, 
1994), it becomes imperative to understand just exactly what the least squares 
procedure is and how it works. 

The purpose of this paper is to illustrate the least squares procedure, 
starting with the simplest case of linear least squares with one independent 
variable and one dependent variable, and generalizing to cases which better model 
reality. Algebra, matrix alf bra and calculus will be employed to some extent, 
with explanation at each step for following the logic. Examples with small data 
sets are given to help mate the procedure more concrete. 

Linear Least Squares 

In the simplest case, the linear relationship between one independent 
variable and one dependent variable is considered. Thinking conceptually about 
correlation, the question might be posed, "How well does a line catch all the points 
in a scattergram of data?". Thinking conceptually about the line catching those 
points, the question is, "How can the equation of the line that does the best at 
catching those points be found?". This line is the least squares line in which the 
sum of the squares of the vertical distances from the data points to the line are 
made as small as possible or rninimized. The line must be the best fit for all the 
points simultaneously . Figure 1 illustrates a line closest to four data points and the 
distances whose sum of squares is to be minimized. 



Insert Figure 1 here 
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Note that if all the points were to lie exactly on a straight line, any two of those 
points could be used to determine the equation of the line using the point-slope 
form y- y x - b(x - x { ), where {x x ,y x ) is any one of the points on the line and b 

y — y. 

is the slope of the line given by b = — L , with (x 2 ,y 2 ) as another point on 

x 2 - x { 

the line. Since all points almost never lie exactly on a straight line, the least 
squares procedure is invoked to determine the equation of the line of best fit. 

To generalize from the distances illustrated in figure 1, suppose that there 
are n data points and that a somewhat linear relationship is expected. The least 
squares line can be found by minimizing the sum 

(i) W 2 + {dj =±d 2 . 

/=1 

The equation of the regression line is given by y'= bx + a, where b is the slope 

of the line and a is the y-intercept. The y-values lying on the regression line 
corresponding to a particular X { are denoted y\ since they are the predicted 
values and not the observed y-values for that x. In other words, the y-values do 
not correspond to the points that actually appear in the scattergram, unless the line 
catches them exactly, but are the y-coordinates of the points on the line. Each 
vertical distance is the difference in the y' and the y values. Using the 
substitution y'=bx + a, each distance can be written as, 
di = y-i-yx = bXj +a- y { , for all </'<«. 
The difference can be written as >',■'->',■ or as y i - y- without changing the 
following results. The substitution changes equation (1), the sum to be minimized, 
to 

(1)' (bx, + a-y,) 2 + (bx 2 + a- y 2 ) 2 +...+{bx n + a- y n f = + a - y) 2 . 



The ordered pairs (.x: 1 , - y 1 ),(A: 2 ,.y 2 ),...,(.x„,.y„) are all known, since they are the 
actual data points. The unknown values b and a in the equation are left to be 
found. 

A calculus technique will be used to minimize equation (1)'. Since there 

are two unknown variables, b and a, in the equation, partial derivatives will be 
taken and set equal to zero to solve. This technique from calculus will allow us to 
both miriimize the one equation and solve for the missing values using linear 
algebra. If the following assumptions are met, then the ordinary least squares 
approximation is the best (most efficient) linear unbiased estimator (BLUE). The 
four assumptions are (1) r ;d x- values, (2) homoscedasticity, (3) error terms 
are uncorrelated (no autocorrelation), and (4) error terms have zero mean 
(Hamilton, 1992). 

For ease of notation, let (!)'=/. The partial derivatives of / are 

£ = 2(bx } +a-y l )(x l ) + 2{bx 2 +a-y 2 )(x 2 )+...+2(bx n +a- y n )(x n ) 

n 

= 2 y £{x i )(bx i + a-y n ) 



»=i 



~ = 2{bx x + a - y x ) + 2^ + a- y 2 )+...+2{bx n +a-y n ) 
ca 

=2±(bx i +a~y i ). 
Com j ining like terms and setting the partials equal to zero: 



cb 



= 2 



f n n n \ 

V 1=1 1=1 i=i J 



= o 



fa 



\ 



b^+na-^ 

V /=i /=i J 



= 0. 
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Writing the partials in simplest form yields a system of two equations and two 
unknowns. 



(i) b 



f n \ ( n \ n 
2 



2>; 

V»=i J 



V/=i J w 



(ii) b J^x, +na = 

V/=i y i=i 



Using the method of elimination by addition for solving systems and the 

n 

multiplicative constants -n • (/') and -(ii), one equation is obtained. 

1=1 



(/)' 



■nb 



c « ^ 

V/=i J 



na 



Z*< 

V»=i J 



~ -n 



( n \ 

v»=i y 



(/■/)• + b^x, 
Vm y 



Vf=i y 



V/=i Am / 



2 _ — 



Z^T-^IZ^ = e*> z* 

V/=i y v«=i y V/=i Am y 



- n 



v«=i y 



To solve the resulting equation for b, factor b out of the first two terms and 
divide. 



n f 

z*, 



n 



y 



" Y n \ 

Z*< Z>/ 

V«=i Am y 



Z-w 

vm y 



(2) 



6 = 



_ y,=i A/=i y 



Z-W 



-to 



v/=i y 



Vm 
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Solving (ii) for a yields: 

(3) a = ^ Vm_^ where b is as above. 

n 

The linear least squares equation or the regression equation for n data points is 
y'=bx + G, where b and a satisfy equations (2) and (3) derived above, and 

yields the line of best fit for a particular data set. Equations (2) and (3) above 
are referred to as the normal equations. As the normal equations suggest, find b 
first, then use (3) to find a . 

Example 1 

Suppose a teacher is interested in the relationship between the number of 
classes a student has missed over a semester and the grade the student received in 
the class. The data are found in table 1. 



Insert Table 1 here 



The number of absences is represented by the independent variable x and the 
grade for the course is represented by the dependent variable y. Figure 2 

illustrates the scattergram of the data. 



Insert Figure 2 here 



To find the least squares equation for this example, equations (2) and (3) are 
used to calculate b and a. The calculations are presented in table 2. 

Insert Table 2 here 
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Substituting from table 2 into equation (2), 

b= 702-480 = 222 g 
676-1200 -524" ' ' 

Since the slope of the line is negative, the graph of the line will fall from left to 
right. The scattergram suggests that a falling line is appropriate. Substituting 
b = -.4 and values from table 2 into equation (3), 

27 + 10.4 

a- =3.74. 

10 

The y-intercept of the line is 3.74 and the equation of the least squares line or 
regression line is /= 3. 74 -.4*. The least squares line is illustrated on the 

scattergram in figure 3. 



Insert Figure 3 here 



The Matrix/Vector Approach 
In order to move to a more general case, the notation of vectors and 
matrices is adopted. The simple case already considered will be adapted to the 
new notation for clear understanding. Taking the general equation for the vertical 
distances d, = y t % ~y t - bx t + a~y,., each distance can be written specifically as 

a + bx x -y x 
a + bx 2 -y 2 

a + bx n ~y n 
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This set of equations decomposes into the matrix A = 



1 x, 

1 X, 



1 x„ 



and the vectors 



u 



and y- 



y n . 



. Then the linear least squares procedure finds u, and 



therefore b and a, so that \Au - y\ is minimized. 

Matrices have dimension (row by column). The matrix A has dimension 

n by 2, written n x 2. The column matrix or vector u has dimension 2x1 and 
the column matrix or vector y has dimension n x 1. The distance \Au -y\ 

suggests the matrix multiplication of A and u . This product matrix can be found 
because the matrices satisfy the linear algebra rule that the number of columns of 
the first matrix must equal the number of rows of the second matrix. The resulting 
product will also be a matrix and will have dimension equal to the number of rows 
of the first matrix by the number of columns of the second matrix. In this case, 
that resulting matrix would have dimension n x 1. In order to subtract two 
matrices, which is the operation that must be performed next, the two matrices 
must have exactly the same dimension. Since the column matrix y is n x , it is 

important to the procedure for the resulting product matrix Au to also have 
dimension n x 1, which it does, as shown above. 

A more general linear least squares case would involve two independent 
variables and a dependent variable. The data points or ordered triples for n 
observations would look like {x u ,x 2X ,y y )Xx {2 ,x 22 ,y 2 ),...,(x Xn ,x 2n ,y n ) . The 

researcher might wonder whether the y values are linearly related to the x 
values. Each distance would be written 




'b 0 +b l x u +b 2 x 2l -y l 
b Q +b l x l2 +h 2 x 22 ~y 2 

b Q +b x x Xn +h 2 x 2n ~y n 
The matrix and vectors for tliis case generalize to 



1 X n ^1 
1 X 12 X 22 



} X ln X 2n 



B = 



J 2\ 



(n x 3) 



(3x1) 



yi 

(nxl) 



The linear least squares procedure will then determine the vector B (and hence 
bv again niinimizing the difference \AB -Y\. Note that the 

i 

multiplication can be performed since A is (n x 3) and u is (3 x 1). The 
resulting product matrix has dimension (nxl). 

The extension to the most general linear least squares case of one 
dependent and k independent variables follows the same notation and will offer 
the generalized normal equation(s). The linear least squares approximation now 
has the form 

y' i = b 0 +b l x kl +b 2 x k2 +...+b k x kn . 
The distances are now 

6 0 + b i x u + h 2 x 2\+ - +b k x k\ -y\ 

^ b 0 + b { x n + b 2 x 22 +...+b k x k2 - y 2 

b 0 + Vm + b 2 x 2n+-+b k x kn - y n 
with the following matrix decomposition, 
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A = 



1 x 



11 X 2\ 



1 AT 12 X 



22 



J X \n X 2n 



Hi 



H2 



"hi 



B = 



pk 



Y = 



y?. 



\{nx{k\\)) L"*J((j(r+l)*l) L-^« J(n^-l) 

Checking dimensions for multiplication of the matrices yields AB as an (n x 1) 

product, matrix. The linear least squares procedure will rninimize the difference 
\AB - Y\. If the columns of A are linearly independent, then the difference 

(y-y) 

is uniquely determined, A' A is positive definite (and therefore, 
nonsingular), (Seber, 1977), (A'A)~ l exists and 
(*) B = {A'A)~ l A'Y, 

where A' denotes the transpose of A. The general normal equation(s) above 
rninimizes the difference \AB - Y\. An indepth discussion of terms used in the 

above reference and the derivation of (*) are beyond the scope of this paper. 

Example 2 

Suppose a teacher is interested in whether math SAT scores, scores on a 
high school math achievement test, and the grade received in a first college math 
class are linearly related to college grade point averages for mathematically gifted 
students. The data for 10 hypothetical students are found in table 3. 



Insert Table 3 here 



The independent variable X l contains the ten math SAT scores, the independent 
variable X 2 contains the ten high school math achievement test scores, the 
independent variable X z contains the ten first math course grades and the student 
grade point averages are represented by the dependent variable Y. The matrix 
representation for this example is 



"12 



"1 


670 


10 


4" 




"3 82" 


1 


630 


9 


4 




3.74 


1 

i 




o 
y 


4 




1 O/l 

1.84 


1 
1 


J IK) 


Q 


•2 
J 






1 

1 


/UU 


o 
o 


1 




3.26 












1 
1 


£>AC\ 
OhK) 


0 


L 




Z.Jj 


1 


630 


7 


2 




3.03 


1 


610 


8 


3 




3.05 


1 


570 


10 


4 




3.76 


1 


550 


8 


2_ 




_2.72_ 



where B is the matrix of constants to be found by the linear least squares 
procedure. The difference to be minimized is 



\AB - Y\ = 



^Wl)-^*!) 



The column matrix or vector B is found by using equation (*), Since the 
inverse of A' A must be found, the computer algebra system MAPLE will be 
implemented for computational ease. The inverse of a (2 x 2) or (3 x 3) matrix 
can easily be computed by hand, but since A' A has dimension (4 x 4) the 
computations are best left to the computer. As equation (*) is implemented, the 
step by step matrix products will be given. 

Equation (*) requires first that the transpose of A be taken. The 
transpose of any matrix is found by switching the rows and the columns. Let the 
columns of A be the rows of A' and the rows of A be the columns of A'. 
Then, 



12 13 



" 1 1 1 1 1 1 1 1 1 I ' 

670 630 610 570 700 640 630 610 570 550 
~ 10 9 6 9 8 6 7 8 10 8 " 
44 1 3322342_ 

Next, the product A' A must be found. Note that since A' has dimension (4 x 
10) and A has dimension (10 x 4), the product can be formed and the resulting 
matrix will have dimension (4 x 4). To perform matrix multiplication, "pour" the 
rows of the first matrix down the columns of the second, multiply like entries and 
add for a total entry. 

10 6,180 81 28 
6,180 3,838,800 49,990 17,370 

A' A = 

81 49,990 675 239 
28 17,370 239 88 

The inverse of the above (4 x 4) matrix must be found. Recall that the inverse 
must exist because the columns of the product matrix are linearly independent and 
the matrix itself is positive definite. MAPLE finds 



(A' AT 1 = 



"2,376,803 


-26,141 


-56367 


223,069 


40,143 


401,430 


13,381 


40,143 


-26,141 


163 


243 


-2,077 


401,430 


2,007,150 


66,905 


401,430 


-56367 


243 


6,114 


-8,104 


13,381 


66,905 


13,381 


13,381 


223,069 


-2,077 


-8,104 


36,506 


40,143 


401,430 


13,381 


40,143 



To find the inverse of a matrix by hand, augment the matrix with the identity 
matrix of the same dimension and perform Gauss-Jordan elirnination. This 
process can get messy as demonstrated by the fractional entries above. 

The first half of equation (*) has now been found. To proceed, the 
product of the inverse matrix and A' is found. The product will combine a (4 x 
4) matrix and a (4 x 10) matrix, to form a (4 x 10) matrix. Even simple 
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operations such as matrix multiplication can become complicated. The computer 
algebra system MAPLE computed the following product matrix: 



"-188,378 


28,929 


-18,335 


20,554 


-49,556 


126,311 


-18,149 


28,867 


24,344 


20,378 


40,143 


13,381 


40,143 


40,143 


13,381 


40,143 


40,143 


13381 


13,381 


40,143 


1,973 


-26? 


208 


-334 


352 


-w 


44? 


-137 


-429 


-701 


401,430 


133,810 


200,715 


200,715 


66,905 


401,430 


401,430 


66,905 


133,810 


401,430 


4,419 


-3,639 


1449 


1,753 


-5,287 


341 


-2,621 


-441 


2,567 


28,929 


13,381 


13,381 


13,381 


13,381 


13,381 


13,381 


13.381 


13381 


13,381 


13,381 


-13,186 


6,478 


-12,994 


-4,610 


-2,433 


17,281 


-4,954 


3,798 


2,528 


-12,650 


40,143 


13,381 


40,143 


40,143 


13,381 


40,143 


40,143 


13,381 


13,381 


40,143 



The last product to be formed is mat of me above matrix with the matrix Y. The 
result will be the matrix product (A'A)~ ] A'Y which contains the unknown 

constants in the column matrix B and the solutions to the linear least squares 
normal equation(s). 

T-.21747" 
.001265 

B = 

.194196 
_.340553_ 

Thus, the linear least squares equation for example 2 is 



/ = -.21747 +.001265*, +.194196* 2 +.340553* 3 . 



Nonlinear least squares 

A brief discussion without examples of quadratic least squares miriimum 
differences and general least squares minimum differences will be related to the 
linear least squares discussion above. 

If data represented by the ordered pairs {x u y l ),{x 2 y 2 ) s ...{x n ,y n ) has been 

collected and the researcher expects a quadratic relationship between the x and 
y values, the distances are represented as 



14 15 



b 0 +b l x l + b 2 xf-y l 
b Q +b x x 2 +b 2 xl-y 2 



b Q +b l x n +b 2 x 2 n -y n 



Consequently, the matrix decomposition looks like: 




B 



b, 



'o 




y, 



n 



The difference to be minimized is the same \AB - Y L where B is found through 
the least squares procedure. 

In the general setting, for data given by tuples (x l1 y l ),(x 2 y 2 ) > ...(x„,y„), 
the researcher might expect the x and y values to be related by 
y ~ bMx) + b 2 f ? (x)+...+b m f m (x) where the constants b t are to be determined 
and the functions f t {x) are the expected relationship between x and y values. 
The system of distances takes the form 



bJ\(x x ) + b 2 f 2 (x x )+...+b m f ti (x l )-y l 
bMx 2 ) + b 2 f 2 (x 2 )+...^ b m f m (x 2 )-y 2 



WiM + b 2 f 2 (x n )+,,.+b m f m (x n )-y ) 



n 



Set 



15 



16 





7i(*i) 


/2O1) 


"" /m(*l)~ 




V 






A = 




f 2 ( X 2) 


- /*(*2) 

• • 


B = 




r = 






AM 




/«(*„)_ 











and again minimize |A5-7|. 

Least squares procedures are sophisticated mathematical curve fitting 
procedures used in all classical parametric methods (Thompson, 1994). Calculus 
and linear algebra tools are implemented throughout the procedure. For large 
amounts of data, computer algebra systems, such as MAPLE or statistics 
packages are useful for carrying out computations. Small data sets are helpful in 
making the linear least squares procedure concrete and enabling computations to 
be done hv hand. 
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Table 1 
Data for example 1 



Student 


Number of absences 


Grade in the class 


Ordered pair 




(*/) 




representation 


1 


2 


B 


(2,3) 


2 


2 


B 


(2,3) 


3 


5 


C 


(5,2) 


4 


2 


C 


(2,2) 


J 


3 


C 


(3,2) 


6 


0 


A 


(0,4) 


7 


3 


C 


(3,2) 


8 


8 


D 


(8,1) 


9 


0 


A 


(0,4) 


10 


1 


A 


(1,4) 



Note: A = 4.0, B = 3.0, C = 2.0, D= 1.0, F = 0.0 



Table 2 

Computations for the formula to find b for example 1 



n 

Z*, 

i'=l 


( n \ 2 

z*, 

V/'=l J 


Z* 

('=] 


;=1 


n 

z*, 2 

1=1 


n n 

;=1 ;=1 


n Z*/# 

1=1 


"2>, 2 

»=1 


26 


676 


27 


48 


120 


702 


408 


1200 
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Table 3 
Data for example 2 



Studen 


Math 


HS Math 


First Math 


College 


Ordered 


t 


SAT 


Achievement 


Course 


Grade Point 


4-tuples 




score 


Score 


Grade 


Average 




1 


670 


10 


A 


3.82 


(670,10,4,3.82) 


2 


630 


9 


A 


3.74 


(630,9,4,3.74) 


3 


610 


6 


D 


1.84 


(610,6,1,1.84) 


4 


570 


9 


B 


3.34 


(570,9,3,3.34) 


5 


700 


8 


B 


3.26 


(700,8,3,3.26) 


6 


640 


6 


C 


2.35 


(640,6,2,2.35) 


7 


630 


7 


C 


3.03 


(630,7,2,3.03) 


8 


610 


8 


B 


3.05 


(610,8,3,3.05) 


9 


570 


10 


A 


3.76 


(570,10,4,3.76) 


10 


550 


8 


C 


2.72 


(550,8,2,2.72) 



Note: A = 4.0, B = 3.0, C = 2.0, D = 1.0, F = 0.0 



Figure 3 

Least Squares Line for data in example I 
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APPENDIX 
MAPLE Commands 



>with(linalg): 

>A:=Tnatrix(#rows,#cols,[row entries separated by commas 

>AT:=transpose(A); 

>innerprod(A,AT); 

>ATA:=innerprod(AT,A); 

>ATAINV:=inverse(ATA); 

>IN V AT :=innerprod( AT AIN V, AT) ; 

>Y:=roafrix(#rows,#cols,[row entries separated by commas 

>B :=innerprod(IN VAT, Y); 
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